Introduction

Description of the dataset:

Forced Expiratory Volume (FEV) is an index of pulmonary function that measures the volume of the air expelled after one second of constant effort. The data contains the determinations of FEB on 654 children ages 6 – 22 who are seen in childhood respiratory disease study in 1980 in East Boston, Massachusetts. The data are part of a larger study to follow the change in pulmonary function over time in children.

Variables in the Dataset:
VariableName Category Description
ID Numeric Uniquely Identified Row
Age Numeric Age of the child
Height Numeric Height of the child in inches
Sex Categorical Sex of the child
Smoker Categorical Whether the child is a non-smoker or current smoker
FEV Numeric FEV of the child in litres
Background information of the dataset:

The data contains the determinations of FEB on 654 children ages 6 – 22 who are seen in childhood respiratory disease study in 1980 in East Boston, Massachusetts. The data are part of a larger study to follow the change in pulmonary function over time in children

Note: No citation required for this source (http://www.statsci.org/data/general/fev.html)

Why is it interesting:

This dataset has chosen by us because of personal interests, whether we can predict the child’s FEV with the help of the available predictor, rather going for a Pulmonary Function Test, which will identify any pulmonary disease of a child. And secondly, to do statistical analysis on what are the predictors responsible to increase / decrease the pulmonary function of the child and try to find answer on these lines.

Why are we creating a model of this Data:

In order to predict the child’s Forced Exporatory Volume (FEV) based on the Child’s Age, Height, Sex and whether the child is a smoker or not. This will help for kick-start the treatment of the child based on the FEV reading, rather going throgh the Pulmonary Function test.

Goal of the Model:

The goal of the model is to find the best model after going through the several methods, which predict the child’s FEV with minimal error. The best model would have the lowest Root Mean Square Error (RMSE) against Leave One Out cross validation (LOOCV).

Methods

We will be doing several data analysis and methods in this section, where each section will be describing what’s the part of the tasks.

Load FEV data, Observations:

childfev = read.csv("http://www.statsci.org/data/general/fev.txt",
                    sep = "\t",
                    quote = "\"",
                    comment.char = "",
                    stringsAsFactors = FALSE)
childfev$Sex = as.factor(childfev$Sex)
childfev$Smoker = as.factor(childfev$Smoker)
str(childfev)
## 'data.frame':    654 obs. of  6 variables:
##  $ ID    : int  301 451 501 642 901 1701 1752 1753 1901 1951 ...
##  $ Age   : int  9 8 7 9 9 8 6 6 8 9 ...
##  $ FEV   : num  1.71 1.72 1.72 1.56 1.9 ...
##  $ Height: num  57 67.5 54.5 53 57 61 58 56 58.5 60 ...
##  $ Sex   : Factor w/ 2 levels "Female","Male": 1 1 1 2 2 1 1 1 1 1 ...
##  $ Smoker: Factor w/ 2 levels "Current","Non": 2 2 2 2 2 2 2 2 2 2 ...
dim(childfev)
## [1] 654   6
head(childfev$FEV,10)
##  [1] 1.708 1.724 1.720 1.558 1.895 2.336 1.919 1.415 1.987 1.942

From the dataset, it observered data, Age, Heigh is a numerical variable, where as Sex / Smoker field is a categorical variable. The FEV is the numerical response variable.

Load FEV data
pairs(childfev[c('Age','FEV','Height','Sex','Smoker')], col = "darkgrey")

From the pairs plot, it sees there is a clear signs of linear releationship between FEV and Height. However there is no clear sign of linear relationship between FEV and Age variable. The 2 categorical variable Sex and Smoker seems to have 2 distinct data. Let’s explore this

Let’s check the Correlation Matrix to explore what’s the co-relation among the variables

cor(childfev[c('Age','Height','FEV')])
##           Age Height    FEV
## Age    1.0000 0.7919 0.7565
## Height 0.7919 1.0000 0.8681
## FEV    0.7565 0.8681 1.0000

It seems that what the pair plot shows seems correct, from the corelation matrix it also suggests that Age and Height are higly corelation with FEV response, as well as among it self, We need to explore the variance inflation factor (VIF) while creating our model in further.

Categorical Data of Sex and Smoker
table(childfev$Sex)
## 
## Female   Male 
##    318    336
table(childfev$Smoker)
## 
## Current     Non 
##      65     589

Model 1 - Multiple Linear Regression (MLR) FEV data:

Let’s start with the Multiple Linear Regression (MLR) against all the predictor (Except the ID), to predict the FEV as the response

mlr_model = lm(FEV ~ Age + Height + Sex + Smoker, data = childfev)

The Multiple Linear Regression contains all the predictor in the model, which also includes the dummy variable for the categorical predictor (Sex and Smoker)

Model 2 - Polynomial Transformation of Numeric Predictor

Let’s explore the polynomial transformation of the predictor to see whether we can able to find the best model which lower the RMSE LOOCV

poly2_model = lm(FEV ~ Sex + Smoker + Age + Height + I(Age^2) + I(Height^2) , data = childfev)
poly3_model = lm(FEV ~ Sex + Smoker + Age + Height + I(Age^2) + I(Height^2) + I(Age^3) + I(Height^3) , data = childfev)
poly4_model = lm(FEV ~ Sex + Smoker + Age + Height + I(Age^2) + I(Height^2) + I(Age^3) + I(Height^3) + I(Age^4) + I(Height^4), data = childfev)
poly5_model = lm(FEV ~ Sex + Smoker + Age + Height + I(Age^2) + I(Height^2) + I(Age^3) + I(Height^3) + I(Age^4) + I(Height^4) + I(Age^5) + I(Height^5), data = childfev)
poly6_model = lm(FEV ~ Sex + Smoker + Age + Height + I(Age^2) + I(Height^2) + I(Age^3) + I(Height^3) + I(Age^4) + I(Height^4) + I(Age^5) + I(Height^5) + I(Age^6) + I(Height^6), data = childfev)

Model 3 - Interaction between Predictors:

Following are the exploration needs to be done as part of the Interaction between the predictors - Interaction between Numerical Predictor (Age, Height) and Categorical Predictor (Sex, Smoker) - One at a Time - Interaction between Numerical Predictors (Age, Height) it self - Interaction between all of the them 2 way and 3 way. - Carry out Inova F Test and RMSE LOOCV and Average Percentage Error to find out what’s the best model

model_age_sex = lm(FEV ~ (Age + Sex) ^ 2 , data = childfev)
model_age_smoker = lm(FEV ~ (Age + Smoker) ^ 2 , data = childfev)
model_age_sex_smoker = lm(FEV ~ Age + Sex + Smoker + Age:Sex + Age:Smoker , data = childfev)
model_height_sex = lm(FEV ~ (Height + Sex) ^ 2 , data = childfev)
model_height_smoker = lm(FEV ~ (Height + Smoker) ^ 2 , data = childfev)
model_height_sex_smoker = lm(FEV ~ Height + Sex + Smoker + Height:Sex + Height:Smoker , data = childfev)
model_all_int_except_age_height = lm(FEV ~ Age + Height + Sex + Smoker + Age:Sex + Age:Smoker + Height:Sex + Height:Smoker, data = childfev) 
model_all_int_with_age_height = lm(FEV ~ Age + Height + Sex + Smoker + Age:Sex + Age:Smoker + Height:Sex + Height:Smoker + Age:Height, data = childfev) 
model_all_2way = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 2, data = childfev) 
model_all_3way = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 3, data = childfev)

Model 4 - Creation of Big Model (Polynomial + Interaction), Stepwise via AIC / BIC:

Let’s create a big model, with Polynomial of degree 3 and 3 way interaction between categorical-to-categorical, categorical-to-numeric, numeric-to-numeric and see it’s score.

Also create few models with combination from the previous model (which treat as good model) - Polynomial degree 2 + Interaction Model 8 - Polynomial degree 2 + Interaction Model 9 (2 way interaction) - Polynomial degree 2 + Interaction Model 10 (3 way interaction)

Big Model: + 3 other model
big_model = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 3 + I(Age^2) + I(Height^2) + I(Age^3) + I(Height^3), data = childfev) # Big Model - Polynomial degree 3 + 3 Way Interaction
poly2_int_model8 = lm(FEV ~ Age + Height + Sex + Smoker + Age:Sex + Age:Smoker + Height:Sex + Height:Smoker + Age:Height , data = childfev) # Polynomial degree 2 + All Interaction Except Sex:Smoker
poly2_int_model9 = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 2, data = childfev) # Polynomial degree 2 + 2 Way Interaction
poly2_int_model10 = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 3, data = childfev) # Polynomial degree 2 + 3 Way Interaction
AIC, BIC of the Big Model:

Let’s do the AIC and BIC to see what’ the outcome, after do the ANOVA F Test.

n = length(resid(big_model))
big_model_aic = step(big_model, direction = "backward", trace = 0)
big_model_bic = step(big_model, direction = "backward", k = log(n), trace = 0)

Let’s compare with the good model from the Interaction model and Polynomial Regression Model, to see which one is siginificance

Results

Let’s evaluate the results which is derived from the above 5 different category of models

Model 1,2 - Multiple Linear Regression (MLR) + Log Transformation + Polynomial - Results:

It seems that as the Polynomial degree increases, the RSS and RMSE decreases, which should be as the model becomes more and more complex, the error will decreased. However the RMSE LOOCV seems increases after the Polynomial degree 2. Hence it’s not been seen good, if we increase the model beyond degree 2, as the Root Mean Square Error (RMSE) Leave One Out Cross Validation Error increases.

Let’s see what happens to the R2 and Adjusted R2 when the polynomail degree increases. Also let’s see what will be the Average Percentage Error, when the Polynomail degree increases.

It seems the R2 keeps increasing as the Polynomial degree increases. However adjusted R2 seems mostly fixed on/beyond Polynomial degree 2. Which kind of gives enough evidence to go for the model having Polynomail degree 2.

The 3rd plot which gives the average percentage of error, which is calculated from the below formula

which depicts that the percentage of error doesn’t decrease after Polynomial degree 2, and increased. This also have inclination towards the Polynomial Regression model with Degree 2.

Let’s do an Inova F Test to compare these models, to see whether the Polynomial degree has significance or not.

Anova F Test

Each point shows the p value comes from the Anova F Test between Polynomial Regresssion of degree 1 Vs Polynomial Regresssion of degree 2 and rest all points are between Polynomial Regresssion of degree 2 and Polynomial Regresssion of degree 3/4/5/6. It clearly observed that for alpha value 0.05 we failed to reject all hypothesis of estimated parameters to be zero for Polynomial of degree more than 2.

Let’s see the comparision of the data in actual between all the models, to concrete the Model with Polynomial regression of degree 2 is the best model

ID Interaction_Model rss rmse rmse_loocv r2 adj_r2
1 MLR 110.28 10.501 0.4145 0.7754 0.7740
2 Polynomial Degree 2 100.99 10.049 0.3980 0.7943 0.7924
3 Polynomial Degree 3 100.59 10.029 0.3984 0.7951 0.7926
4 Polynomial Degree 4 100.50 10.025 0.3996 0.7953 0.7921
5 Polynomial Degree 5 100.21 10.010 0.3989 0.7959 0.7921
6 Polynomial Degree 6 99.47 9.973 0.4012 0.7974 0.7929

It seems even though ANOVA F Test suugest to Model 10 is significance (3 Way), the RMSE LOOCV is small (not that though) against Model 8. We will keep these 2 modesl (Model 8 and Model 10) are the best model, which we will explore further

It also clearly gives confidence that Polynomial degree 2 would be the best model, from the following methods which we have tested - Lowest RMSE LOOCV - Highest R2 and Adjusted R2 - Lowest Average Percentage Error - Anova F Test P value

Model 3 - Interaction Model - Results:

It seems that the interaction between the Age with categorical variable (Sex and Smoker) has changed the estimates drasticallly, than the interaction between Height with the Categorical variable (Sex and Smoker).

Let’s conduct the Anova F Test to verify, which interaction is significance and which is not

Anova F Test
ID null_Model full_Model p_value descision
1 Age+Sex+Age:Sex Age+Sex+Smoker+Age:Sex+Age:Smoker 0.0000 Age+Sex+Smoker+Age:Sex+Age:Smoker
2 Age+Smoker+Age:Smoker Age+Sex+Smoker+Age:Sex+Age:Smoker 0.0000 Age+Sex+Smoker+Age:Sex+Age:Smoker
3 Height+Sex+Height:Sex Height+Sex+Smoker+Height:Sex+Height:Smoker 0.4218 Height+Sex+Height:Sex
4 Height+Smoker+Height:Smoker Height+Sex+Smoker+Height:Sex+Height:Smoker 0.0000 Height+Sex+Smoker+Height:Sex+Height:Smoker
5 Age+Sex+Smoker+Age:Sex+Age:Smoker Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker 0.0000 Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker
6 Height+Sex+Height:Sex Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker 0.0000 Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker
7 Height+Sex+Smoker+Height:Sex+Height:Smoker Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker 0.0000 Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker
8 Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker+Age:Height 0.0000 Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker+Age:Height
9 Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker+Age:Height 2 way 0.8041 Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker+Age:Height
10 Age+Height+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker+Age:Height 3 way 0.0000 3way

It seems the 3 way interaction seems significance among all the interaction between the numeric and categorical variable.

Let’s see how these model perform on the RSS, RMSE, RMSE LOOCV and

It seems that as the Interaction increased, the RSS and RMSE decreases, which should be as the model becomes more and more complex, the error will decreased. Also the RMSE LOOCV also decreases with increase in Interaction, however the RMSE LOOCV increased slighly after the Model 9

Let’s see what happens to the R2 and Adjusted R2 when the polynomail degree increases. Also let’s see what will be the Average Percentage Error, when the Polynomail degree increases.

It seems the R2 and adjusted R2 keeps increasing as the Polynomial degree increases. Also the error rate is also decreasing as the interaction increases. However after the Model 9 the Error rate increases slightly as the interaction increases

Let’s see the comparision of the data in actual between all the models, to see what’s the best finalize model

ID Interaction_Model rss rmse rmse_loocv r2 adj_r2
1 Age+Sex+Age:Sex 175.51 13.248 0.5214 0.6425 0.6408
2 Age+Smoker+Age:Smoker 199.27 14.116 0.5566 0.5941 0.5922
3 Age+Sex+Smoker+Age:Sex+Age:Smoker 165.53 12.866 0.5085 0.6628 0.6602
4 Height+Sex+Height:Sex 114.88 10.718 0.4218 0.7660 0.7649
5 Height+Smoker+Height:Smoker 120.32 10.969 0.4324 0.7549 0.7538
6 Height+Sex+Smoker+Height:Sex+Height:Smoker 114.58 10.704 0.4231 0.7666 0.7648
7 Age+Height+Sex+Smoker+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker 104.29 10.212 0.4059 0.7876 0.7849
8 Age+Height+Sex+Smoker+Age:Sex+Age:Smoker+Height:Sex+Height:Smoker+Age:Height 98.15 9.907 0.3948 0.8001 0.7973
9 2 Way 98.14 9.907 0.3964 0.8001 0.7970
10 3 Way 93.89 9.907 0.3964 0.8087 0.8046

It kind of confirmation that Polynomial regression with degree 2 is lowest RMSE LOOCV. We have decided to see it’s the best model.

Model 4 - Big Model, AIC, BIC - Results:

ANOVA F Test of the AIC/BIC

ID null_Model full_Model p_value descision
1 Polynomial Degree 2 Big Model 0.0000 Big Model
2 All Interaction Except Sex:Smoker Big Model 0.0001 Big Model
3 2 way Big Model 0.0000 Big Model
4 3 way Big Model 0.2198 3 way
5 Polynomial Degree 2 + All Interaction Except Sex:Smoker Big Model 0.0001 Big Model
6 Polynomial Degree 2 + 2 way Big Model 0.0000 Big Model
7 Polynomial Degree 2 + 3 way Big Model 0.2198 Polynomial Degree 2 + 3 way
8 Big Model - AIC Big Model 0.7976 Big Model - AIC
9 Big Model - BIC Big Model 0.3877 Big Model - BIC
10 AIC - Polynomial Degree 2 + All Interaction Except Sex:Smoker Big Model 0.0003 Big Model
11 BIC - Polynomial Degree 2 + All Interaction Except Sex:Smoker Big Model 0.0000 Big Model
12 AIC - Polynomial Degree 2 + 2 way Big Model 0.0003 Big Model
13 BIC- Polynomial Degree 2 + 2 way Big Model 0.0000 Big Model
14 AIC - Polynomial Degree 2 + 3 way Big Model 0.3877 AIC - Polynomial Degree 2 + 3 way
15 BIC - Polynomial Degree 2 + 3 way Big Model 0.3877 BIC - Polynomial Degree 2 + 3 way

From the above Table, its been observed that the AIC / BIC model of the Big model / Polynomial Degree 2 + 3 way communication is being significance.

Let’s do the RSS, RMSE, RMSE LOOCV, R2 and Adjusted R2 and Average Error Rate to compare these 15 models

It seems that as the Interaction increased, the RSS and RMSE decreases, which should be as the model becomes more and more complex, the error will decreased. Also the RMSE LOOCV also decreases with increase in Interaction, however the RMSE LOOCV increased slighly after the Model 9

Let’s see what happens to the R2 and Adjusted R2 when the polynomail degree increases. Also let’s see what will be the Average Percentage Error, when the Polynomail degree increases.

It seems the R2 and adjusted R2 keeps increasing as the Polynomial degree increases. Also the error rate is also decreasing as the interaction increases. However after the Model 9 the Error rate increases slightly as the interaction increases

Let’s see the comparision of the data in actual between all the models, to see what’s the best finalize model

ID Interaction_Model rss rmse rmse_loocv adj_r2
1 Polynomial Degree 2 100.99 10.049 0.3980 0.7924
2 All Interaction Except Sex:Smoker 98.15 9.907 0.3948 0.7973
3 2 way 98.14 9.907 0.3964 0.7970
4 3 way 93.89 9.690 0.3907 0.8046
5 Polynomial Degree 2 + All Interaction Except Sex:Smoker 98.15 9.907 0.3948 0.7973
6 AIC - Polynomial Degree 2 + All Interaction Except Sex:Smoker 98.24 9.912 0.3933 0.7977
7 BIC - Polynomial Degree 2 + All Interaction Except Sex:Smoker 99.94 9.997 0.3949 0.7949
8 Polynomial Degree 2 + 2 way 98.14 9.907 0.3964 0.7970
9 AIC - Polynomial Degree 2 + 2 way 98.24 9.912 0.3933 0.7977
10 BIC- Polynomial Degree 2 + 2 way 99.94 9.997 0.3949 0.7949
11 Polynomial Degree 2 + 3 way 93.89 9.690 0.3907 0.8046
12 AIC - Polynomial Degree 2 + 3 way 93.98 9.694 0.3892 0.8050
13 BIC - Polynomial Degree 2 + 3 way 93.98 9.694 0.3892 0.8050
14 Big Model 93.05 9.646 0.3925 0.8051
15 AIC - Big Model 93.29 9.659 0.3897 0.8058
16 BIC - Big Model 93.98 9.694 0.3892 0.8050

The best model from the above 16 models is the Model 12 with lowest RMSE LOOCV which is 0.3892, though doesn’t have the highest RSS or RMSE. Let’s explore the Linearity Assumptions of this model

Diagnostic Plot - Best Model :

Below are the 2 assumptions which needs to be tested as part of the section - Constant Variance - Normality

Let’s plot the residual versus Fitted plot and Q-Q plot of the model

par(mfrow = c(1, 2), oma = c(0, 0, 2, 0))

plot(resid(poly2_int_model10_aic)~fitted(poly2_int_model10_aic),
     pch = 1,
     cex = 1,
     cex.main = 0.8,
     col = "darkgrey",
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Fitted Vs Residual - Polynomial Degree 2 + 3 Way Interaction Model")
abline (h = 0, col = "darkorange", lwd = 2, lty = 1)

qqnorm(resid(poly2_int_model10_aic), 
       pch = 1,
       cex = 1,
       cex.main = 0.8,
       col = "darkgrey")
qqline(resid(poly2_int_model10_aic),
       lwd = 1,
       col = "red",
       lty = 2)

It seems both the Constant Variance and Normality is in Suspect. From the residual plot, it seems that the for lower fitted values the variance is small, as the the fitted values increases, the variance also increasing. Also for the Q-Q plot, it seems have a fat tail.

Let’s check the BP Test and Shapiro Wiki Test to validate this

bptest(poly2_int_model10_aic)
## 
##  studentized Breusch-Pagan test
## 
## data:  poly2_int_model10_aic
## BP = 88, df = 12, p-value = 1e-13
shapiro.test(resid(poly2_int_model10_aic))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(poly2_int_model10_aic)
## W = 0.99, p-value = 0.002

It seems that both the P value is small, which confirms the Constant Variance and Normality is Suspect.

Outlier Diagnostic - All Models:

Let’s check for any influence points in the model

Large Leverage

head(hatvalues(poly2_int_model10_aic)[hatvalues(poly2_int_model10_aic) > 2 * mean(hatvalues(poly2_int_model10_aic))],10)
##       2      23      26      59      64     104     118     191     216     222 
## 0.04065 0.05486 0.05998 0.04095 0.04734 0.06494 0.04812 0.44784 0.05486 0.10747

There are around 84 observations which have the high leverage

Let’s check the high standard residuals

Large Standard Residuals

head(rstandard(poly2_int_model10_aic)[abs(rstandard(poly2_int_model10_aic)) > 2],10)
##      2     53    325    332    333    368    372    422    425    442 
## -3.282  2.483  2.251 -2.127 -2.073 -2.230  2.469  2.928 -2.018  2.530

There are around 30 observations which have the high Standardize Residuals

Even though the model has high leverage and standard residuals, let’s check whether these are influentials or not.

This can be verified by the cook’s distance

n = length(resid(poly2_int_model10_aic))
length(cooks.distance(poly2_int_model10_aic)[cooks.distance(poly2_int_model10_aic) > (4/n)])
## [1] 41

It has been observed that there are around length(cooks.distance(poly2_int_model10_aic)[cooks.distance(poly2_int_model10_aic) > (4/n)]) influential observations in the model

Re-Fit the Model, by removing these influentials

Now, let’s re-fir the model, after removing the influential data to see if the model has improved or not

model_cd = cooks.distance(poly2_int_model10_aic)
poly2_int_model10 = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 3, data = childfev, subset = model_cd <= (4/n)) # Polynomial degree 2 + 3 Way Interaction
poly2_int_model10_aic = step(poly2_int_model10, direction = "backward", trace = 0)

The model re-fit is now completed. Now let’s draw the diagnostic plot once again

par(mfrow = c(1, 2), oma = c(0, 0, 2, 0))

plot(resid(poly2_int_model10_aic)~fitted(poly2_int_model10_aic),
     pch = 1,
     cex = 1,
     cex.main = 0.8,
     col = "darkgrey",
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Fitted Vs Residual - Polynomial Degree 2 + 3 Way Interaction Model")
abline (h = 0, col = "darkorange", lwd = 2, lty = 1)

qqnorm(resid(poly2_int_model10_aic), 
       pch = 1,
       cex = 1,
       cex.main = 0.8,
       col = "darkgrey")
qqline(resid(poly2_int_model10_aic),
       lwd = 1,
       col = "red",
       lty = 2)

The plot seems much better than the earlier one. The constant Variance still seems suspect, as the constant variance is low to high as the fitted values increased from low to how. However the Q-Q plot seems good, hence the normality is not in suspect.

Let’s do the BP Test and Shapiro Wiki test to confirm

bptest(poly2_int_model10_aic)
## 
##  studentized Breusch-Pagan test
## 
## data:  poly2_int_model10_aic
## BP = 75, df = 12, p-value = 3e-11
shapiro.test(resid(poly2_int_model10_aic))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(poly2_int_model10_aic)
## W = 1, p-value = 0.6

As from the plot, the P value of the BP Test is too low, which confirms the Variance Is Suspect, where as the P value from the Shapiro Wiki Test is high which is greater than for any alpha value (0.5, 0.1), which confirms the Normality is not suspect

Discussion

Dicussion on Why we have Choosen the Model (12)

At the final we have around 16 models which seems to be evalauted. Out of the 16 models, we have selected our best model as the Model 12 based on the soley fact, it has the lowest Leave One Out Cross Validation (LOOCV). This model comes after the AIC performed on the model Polynomial Regression of Degree - 2 and 3 Way Interaction between the Predictrs

Below is the re-drawn of the RSS, RMSE, RMSE LOOCV, where locate the point number (12) for the model

The RMSE LOOCV is the lowest, the RSS and RMSE is not the lowest one. As we know that RSS and RMSE always decrease as the model complexity increases, hence it will be always low if we go furthe complex model beyond (12). However RMSE LOOCV is great way to validate the model performance, it’s a good model which will predict the FEV.

Also below is the replot of the R2 and Adjusted R2, where the circle is marked on the Model (12)

Percent of Error

Let’s dicuss what’s the percentage of Error Produced by the Model 12. But before that let’s create a function which will calculates the percent of error from the below formula

  • Calculate the average percent error: \[ \frac{1}{n}\sum_i\frac{|\text{predicted}_i - \text{actual}_i|}{\text{predicted}_i} \times 100 \]

We can see that the Percent of Error for the Model Model 12 is around train_test_error_final_model[12] which is not bad, though we have done a random sampling with seed. It could be possible for many many sampling, the percentage of error could reduced. However, since the Model 12 is based on the lowest RMSE LOOCV, we will finalize the model 12 as the final model